This vignette introduces scBubbletree, a transparent methodology for single cell RNA-seq data exploration based on well established methods clustering and visualization. We will demonstrate the functionality of scBubbletree by analyzing three scRNA-seq datasets (Case studies A, B and C), while also showing the user how to integrate scBubbletree with existing pipelines for scRNA-seq analysis e.g. based on Seurat.

To run this vignette we need several R-packages. Load them now:

source(file = "~/Rutil/Init_Rpack.R")
source(file = "~/Rutil/Graphics.R")
source(file = "~/Rutil/Stats.R")

# source(file = "src/ClusteringUtil.R")
source(file = "../scBubbletree/R/util.R")
library(farver, lib.loc = "/usr/local/lib/R/site-library")

library(Seurat, lib.loc = lib.loc)
library(ggplot2, lib.loc = lib.loc)
library(reshape2, lib.loc = lib.loc)
library(parallel, lib.loc = lib.loc)
library(ggtree, lib.loc = lib.loc)
library(cluster, lib.loc = lib.loc)
library(ape, lib.loc = lib.loc)
library(org.Hs.eg.db, lib.loc = lib.loc)
library(bluster, lib.loc = lib.loc)
library(SummarizedExperiment, lib.loc = lib.loc)

Case study A: 3,918 cells from 5 cancer cell lines 1

Data

In this case study we will analyze scRNA-seq from 3918 cancer cells derived from 5 cell lines: H2228, H1975, HCC827, H838 and A549

Cell types (ground truth) were inferred for each cell on the basis of known genetic variation with demuxlet. We will use this simple dataset to demonstrate the advantages of quantitative scRNA-seq data exploration with scBubbletree in combination with Seurat preprocessing.

Load the data and perform some basic scRNA-seq data processing steps with Seurat:

# Lets load the benchmark data
load(file = "raw_data/Tian_2019/sc_mixology-master/data/sincell_with_class_5cl.RData")


# We are only interested in the 10x data object 'sce_sc_10x_5cl_qc'
d <- sce_sc_10x_5cl_qc


# Remove the remaining objects
rm(sc_Celseq2_5cl_p1, sc_Celseq2_5cl_p2, sc_Celseq2_5cl_p3, sce_sc_10x_5cl_qc)


# Get the meta data for each cell
meta <- colData(d)[, c("cell_line", "cell_line_demuxlet", "demuxlet_cls")]


# Create Seurat object from the raw counts and append the meta data to it
d <- Seurat::CreateSeuratObject(counts = d@assays$data$counts)
# check if all cells are matched between d and meta
# table(rownames(d@meta.data) == meta@rownames) 
d@meta.data <- cbind(d@meta.data, meta@listData)


# cell type predictions are provided in the meta data
table(d@meta.data$cell_line_demuxlet)


# Preprocessing with Seurat: SCT transformation + PCA + UMAP 
d <- SCTransform(object = d)
d <- RunPCA(object = d, npcs = 20, features = VariableFeatures(object = d))
d <- RunUMAP(d, dims = 1:20)

The generated 2D UMAP shown below (left panel) contains 5 to 8 clusters of cells. After we color-code the cells by their predicted cell lines (right panel) we see 5 clusters, with some substructure also visible within the cell lines. This is a toy dataset composed of distinct cell lines, which makes the 2D UMAP simple to interpret. In more realistic scenarios, such as analyses of scRNA-seq data derived from PBMCs, we typically see more complex 2D UMAPs composed of cell clusters that are not clearly separated. Furthermore, in these analyses we will not have ground truth, and cell lines or cell types will have to be inferred based on different marker genes.

scBubbletree approach

Now lets turn to scBubbletree. As first input scBubbletree uses matrix \(A^{n\times f}\) which represents a low-dimensional feature space of the original scRNA-seq data, with \(n\) rows as cells and \(f\) columns as low- dimension features. We will use the PCA data generated by Seurat as \(A\).

# This is the main input of scBubbletree -> matrix A
A <- d@reductions$pca@cell.embeddings

# A has n=cells as rows, f=features as columns (e.g. from PCA)
dim(A)
FALSE [1] 3918   20

The scBubbletree algorithm performs four main operations:

  1. clustering with k-means
  2. hierarchical organization of clusters (bubbles)
  3. annotation
  4. visualization

As scBubbletree uses k-means clustering of \(A\) to identify clusters of transcriptionally similar cells, our first goal is to determine k, i.e. the number of clusters in the data. This can be achieved with the function get_k.

get_k performs k-means clustering B times (bootstrapping iterations) for a vector of k values specified by the parameter ks. For this we draw a subset of \(A\) by sampling a proportion (cv_clust_p) of \(A\) with replacement. get_k then computes three metrics for each k and B which allow us to find the optimal k:

  • silhouette index
  • gap statistic
  • within sum of squares (WSS) for elbow method

The remaining parameters of get_k are used to tune the algorithm and are explained below:

  • n_start and iter_max: used to tune k-means (see ?kmeans)
  • cores: computer cores to be used
  • cv_gap_p proportion of cells to be used for the computation of the gap statistic. For small samples (e.g. < 20,000 cells), we suggest that you use the default parameters. For larger samples even low cv_gap_p=0.1 produces robust results. Small B (e.g. B=5) might be necessary to execute the function in a less than one hour for large datasets.

Lets run get_k now.

b <- get_k(B = 20, 
           cv_clust_p = 1, 
           cv_gap_p = 1, 
           ks = 2:15,
           x = A,
           n_start = 10, 
           iter_max = 50, 
           cores = 20)
FALSE boot: 1  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 2  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 3  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 4  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 5  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 6  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 7  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 8  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 9  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 10  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 11  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 12  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 13  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 14  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 15  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 16  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 17  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 18  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 19  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 20  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS.

The results of get_k are shown below. Points in these plots are results from each bootstrap iteration. We used B=20, hence for each k we have 20 points.

k is is easily discernible from each of the three metrics (panels).

The silhouette index peaks around k=5 (left panel) and for larger ks the silhouette index starts to drop and is numerically less stable. Hence, we can conclude that k values between 5 and 8 seem reasonable for this dataset, which is consistent with the 2D UMAP structure shown before. This conclusion is confirmed by the gap statistic (middle panel) and the elbow method based on WSS (right panel), i.e. we see a kinks around k=5 in the corresponding plots.

Lets start with the simplest model k=5 and use scBubbletree to perform clustering and use the clustering data to generate a dendrogram, showing the clusters as bubbles (tree leaves) in the dendrogram.

# Perform clustering to get data for scBubbletree
btd_k5 <- get_bubbletree_data(x = A,
                              k = 5,
                              seed = 1234,
                              cores = 1,
                              B = 50,
                              N_eff = 100,
                              verbose = F,
                              n_start = 100,
                              iter_max = 100,
                              round_digits = 2,
                              show_branch_support = T)

Lets now plot the dendrogram:

btd_k5$tree

Bubbles

The generated dendrogram has k=5 bubbles (clusters) shown as tree leaves. The radius of each bubble is scaled as function of the the number of cells that belong to the corresponding cluster. Analogously, the bubbles are color-coded according to their sizes, i.e. dark bubbles are larger and have many cells, and bright bubbles are small and contain few cells. The absolute and relative cell frequencies in the different bubbles are shown as labels.

Bubble 3 is the largest (and darkest) one in the dendrogram and contains 1,253 cells (\(\approx\) 32% of all cells in the dataset). Bubble 5 is the smallest one (and brightest) and contains only 436 cells (\(\approx\) 11% of all cells in the dataset).

Dendrogram structure

The average distances between a pair of bubbles are represented by the sums of branch length in the dendrogram. We can get a summary of the distances between pairs of clusters from the btd_k5 object by invoking the following command:

# c_i, c_j = pair of clusters/bubbles
# M = mean Euclidean distance (from B bootstraps)
# L90, H90 = lower and upper bound of the 90% Highest Density Interval of M;
knitr::kable(btd_k5$dend_dist$hc_pair_dist, digits = 1)
c_i c_j M L90 H90
1 1 0.0 0.0 0.0
1 2 87.8 86.7 88.9
1 3 86.2 84.6 88.2
1 4 85.9 84.2 88.2
1 5 85.9 84.2 88.2
2 1 87.8 86.7 88.9
2 2 0.0 0.0 0.0
2 3 86.2 82.3 88.9
2 4 87.0 84.8 88.9
2 5 87.0 84.8 88.9
3 1 86.2 84.6 88.2
3 2 86.2 82.3 88.9
3 3 0.0 0.0 0.0
3 4 83.7 81.7 86.7
3 5 83.7 81.7 86.7
4 1 85.9 84.2 88.2
4 2 87.0 84.8 88.9
4 3 83.7 81.7 86.7
4 4 0.0 0.0 0.0
4 5 74.1 72.8 75.1
5 1 85.9 84.2 88.2
5 2 87.0 84.8 88.9
5 3 83.7 81.7 86.7
5 4 74.1 72.8 75.1
5 5 0.0 0.0 0.0

We see similar distances between the pairs of bubbles, and low degree of structure in the dendrogram. This makes sense as the data is composed of cells derived from from 5 distinct cell lines. We will see in case study B, in which we will inspect real dataset of PBMCs, that cell types in real datasets are structured, i.e. we will see branches in the dendrogram formed by transcriptionally similar groups of cells (e.g. CD4 and CD8 T-cells).

We can decide to show support for branches in the dendrogram by setting the parameter show_branch_support = TRUE in the function get_bubbletree. This will produce labels showing the support for each branch in the dendrogram among B generated dendrograms during the bootstrapping operation with B bootstrapping iterations. We call get_bubbletree_data with B=50.

Can we verify that this clustering is sensible?

Categorical annotations

For this scBubbletree offers a rich set of annotation add-ons. For instance, for each cell we have the predicted cell types. We can generate a tile-plot showing us the proportions of each cell type in the five bubbles.

For categorical data we use the function get_annotation_tiles_char:

w1 <- get_annotation_tiles_char(k = btd_k5$cluster,
                                a = d@meta.data$cell_line_demuxlet,
                                tree_meta = btd_k5$tree_meta,
                                integrate_over_clusters = F)

scBubbletree uses the R-package patchwork to combine plots. Lets merge the dendrogram with the newly generated annotations:

(btd_k5$tree|w1$w)+patchwork::plot_layout(ncol = 2, widths = c(1, 1))

This plot tells us that cells with each annotation are exclusively (>99%) found in a common bubble, i.e. columns integrate to 100%. For instance, 99.76% of the A549 cells are found in bubble 3, 99.09% of the H1975 cells are found in bubble 5. Few cells from a given cell type are mixed in different bubbles. This mixing is also present in the 2D UMAP, however, doe to the heavy overplotting they are not visible. Our approach is more transparent in this sense, and show in a quantitative way the clustering output.

We might also be curious to check the homogeneity of individual bubbles. For this we will draw a similar plot but make sure that the rows integrate to 100%.

The annotation plot below informs us that 100% of the cells in bubble 5 belong to cell type H1975. About 0.17% of the 593 cells in bubble 4 belong to cell A549. This is equal to 0.17/100*593 = 1 cell, while the majority of cells 98.99 belong to cell HCC827. Hence, we see quite high homogeneity within the bubbles.

w2 <- get_annotation_tiles_char(k = btd_k5$cluster,
                                a = d@meta.data$cell_line_demuxlet,
                                tree_meta = btd_k5$tree_meta,
                                integrate_over_clusters = TRUE) # <- changed

(btd_k5$tree|w2$w)+patchwork::plot_layout(ncol = 2, widths = c(1, 1))

Gini index

Using the Gini impurity index we can quantify the goodness of the clustering with k=5. This functionality is implemented by the function get_gini.

FALSE $cluster_gini
FALSE           4           2           5           1           3 
FALSE 0.020099588 0.004563592 0.000000000 0.010474495 0.000000000 
FALSE 
FALSE $total_gini
FALSE [1] 0.006095786

Both the total Gini index and the cluster-specific Gini indices are close to 0, indicating accurate clustering of cell types as most bubbles consist of cells that belong to a single cell line. This is a quantitative way of summarizing the tile plot visualization shown earlier.

Our approach can integrate the results obtained by function get_k with the cell type predictions (labels), and show quantitatively the change in Gini index as a function of k. This is done with the function get_gini_boot. In light of these cell type predictions, we can conclude that at k=5 most bubbles in the bubble tree are pure and contain cells from distinct cell types.

gini_boot <- get_gini_boot(labels = d@meta.data$cell_line_demuxlet,
                           kmeans_boot_obj = b)
ggplot(data = gini_boot$total_gini)+
  geom_jitter(aes(x = k, y = total_gini),
              size = 1, shape = 21, width = 0.1, height = 0)

Quantitative annotations

scBubbletree also implements add-ons for visualization of quantitative cell properties, such as gene expression. Lets visualize the expression of five marker genes of the five cancer cell lines. We can do this in two ways. First, we can show the average marker gene expression in each bubble for each marker as a tile plot with get_annotation_tiles_num.

Lets invoke this function now.

# First we need to select gene expressions for each cell and 
# also for five marker genes
as <- as.matrix(t(d@assays$SCT@data[
  rownames(d@assays$SCT@data) %in% 
    c("ALDH1A1", "OS9", "PEG10", "S100A9", "SLPI"), ]))

# 'as' is a matrix with n=rows for cells and a=columns for 
# annotations (genes). The column names will be shown in
# the plot.

# We will order the columns in 'as' in the same way we want
# them to be plotted. These genes are markers for: A549, 
# HCC827, H1975, H2228 and H838
as <- as[, c("ALDH1A1", "OS9", "PEG10", "S100A9", "SLPI")]
# This function build the nummeric annotations plot
w3 <- get_annotation_tiles_num(k = btd_k5$cluster,
                               as = as,
                               tree_meta = btd_k5$tree_meta,
                               plot_title = "")

# Plot
(btd_k5$tree|w3$w)+patchwork::plot_layout(widths = c(1, 1))

Second, we can visualize the distribution of each marker gene in each bubble using violin plots with get_annotation_violins. This function uses the same input as get_annotation_tiles_num.

Lets invoke this function now.

w4 <- get_annotation_violins(k = btd_k5$cluster,
                             as = as,
                             tree_meta = btd_k5$tree_meta,
                             plot_title = "",
                             scales = 'free_x')

(btd_k5$tree|w3$w|w4$w)+patchwork::plot_layout(widths = c(1, 1, 2))

Compositional visualization

scBubbletree uses the R-package patchwork to combined figures. This makes it convenient for users to ‘attach’ their own plots to the results from this package. Lets combine the UMAP plots shown earlier with the output of scBubbletree.

(((btd_k5$tree|w3$w|w4$w)+patchwork::plot_layout(widths = c(1, 1, 2)))/
  ((w1$w|u)+patchwork::plot_layout(widths = c(1, 3))))

Summary of the approach

scBubbletree is intended to promote transparent analysis of scRNA-seq. The user of scBubbletree is encouraged to try different values of k and at each stage perform bubble/cluster annotation. From this, the user has to provide answer to the following questions:

  1. is the selected k biologically justifiable?
  2. is the structure of the generated dendrogram biologically meaningful? For instance:
    • are relative sizes of bubble (cell-type) consistent with prior knowledge?
    • are bubble distances in the dendrogram smaller between similar cell types?
    • can we predict the cell types of the cells in each bubble based on e.g. marker gene expression
  3. if we increased k, could we prove that the newly created bubbles represent new cell types using categorical or quantitative (e.g. ) annotations?

We will provide answer to these questions in Case study B.

Case study B: 53,000 PBMCs from 8 healthy donors 2,3

Data

In this case study we will analyze scRNA-seq dataset of approximately 53,000 PBMCs derived from 8 healthy donors. This represents a typical scenario that most user of scBubbletree will encounter. In addition to gene expression data, this dataset reports for each cell a cell type prediction at three levels of resolution (l1, l2 and l3). The cell types have been predicted based on marker genes, and we will use them as categorical cell annotations.

In essence, we will perform the same steps as in Case study A:

  1. determine k
  2. perform clustering and generate dendrogram
  3. inspect the dendrogram structure by annotating it with categorical and quantitative properties of the cells
  4. criticize the results, and if necessary modify k and go back to 2.

The raw gene expressions have already been preprocessed with Seurat. SCT transformation was used for normalization, followed by principal component analysis for dimensionality reduction. 50 lower-space features have been extracted and used to generate a 2D UMAP.

Lets load the data.

# To get the data used in this case study do the following steps:
# 1. download reference data from vignette:
# ttps://satijalab.org/seurat/articles/multimodal_reference_mapping.html
# 2. load SeuratDistk
# library(SeuratDisk, lib.loc = lib.loc)
# 3. subset data at time=0
# d_t0 <- subset(x = d, subset = time == 0)
# d <- LoadH5Seurat("~/BubbleMap/raw_data/Hao_2021/pbmc_multimodal.h5seurat")
# save(d_t0, file = "Hao_2021_t0.RData")
d_t0 <- get(load(file = "raw_data/Hao_2021_t0.RData"))

We can show the number of cells per donor and predicted cell type at the lowest resolution (l1)

table(d_t0@meta.data$celltype.l1,
      d_t0@meta.data$donor)
FALSE          
FALSE             P1   P2   P3   P4   P5   P6   P7   P8
FALSE   B        240  470  373  410  544  732  987  394
FALSE   CD4 T   2198 1921 1422 1524  897 1062 1259 2052
FALSE   CD8 T   1155 1140  527  913 1160  356 1223 1648
FALSE   DC       128   93   72  114  216   95  384  123
FALSE   Mono    1914 1330 1188 1749 2777 2773 3792 3193
FALSE   NK       495  555  891  119  865  796  998 1075
FALSE   other    219   92   67   58  221  162   92  101
FALSE   other T   94  377  158  420  340  117  174  330

Next we will show the 2D UMAP. Cells are color coded according to their cell type predictions resolution level l1 (left UMAP) and l2 (right UMAP):

# Lets show the generated 2D UMAP
UMAPPlot(object = d_t0, 
         reduction = "umap", 
         group.by = 'celltype.l1', 
         pt.size = 0.25)|
  UMAPPlot(object = d_t0, 
         reduction = "umap", 
         group.by = 'celltype.l2', 
         pt.size = 0.25)

Some challenges

Now that the UMAP shows more than 50,000 cells we start to see the challenges associated with this approach:

  1. massive overplotting

    • limited visibility: the cells (points) form blobs of cells \(\rightarrow\) some points are covered by others
    • even at resolution level l1 we have difficulty to distinguish between cell types. This is quite more challenging for level l2 and l3 (not shown)
    • difficult to ascertain absolute/relative frequencies of cell types
    • increasing the number of cells in the dataset, which is likely as throughput of scRNA-seq techniques improves, will make the problem even worse
  2. qualitative analysis

    • distances between cell types not quantifiable (per UMAP manual)
    • B-cells are position
  3. unstable: UMAP rotation affects interpretation

scBubbletree approach

Now lets turn to scBubbletree. As earlier, our first goal is to determine k for the clustering of matrix \(A^{n \times f}\), which represents the low- dimensional feature space of the normalized gene expression matrix. We will use the first \(f\)=20 PCA dimensions as features. We selected \(f\)=20 based on the elbow plot below, even though lower \(f\)=10 or \(f\)=15 might also be appropriate.

ElbowPlot(object = d_t0, ndims = 50, reduction = "pca")

# This is the main input of scBubbletree 
# -> matrix A with n=cells, f=features (from PCA)
A <- d_t0@reductions$pca@cell.embeddings[, 1:20]

We will once again use the function get_k for clustering. Notice the modified function parameters which will help us achieve faster execution of this large dataset:

# Determine appropriate number of clusters (k)
b <- get_k(B = 20,
           cv_clust_p = 1, # use 100% for clustering
           cv_gap_p = 0.1, # use 10% for gap stat.
           ks = 2:50,
           x = A,
           n_start = 10,
           iter_max = 50,
           cores = 20)
FALSE boot: 1  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 2  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 3  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 4  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 5  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 6  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 7  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 8  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 9  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 10  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 11  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 12  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 13  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 14  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 15  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 16  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 17  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 18  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 19  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 20  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS.

We see high silhouette values for k=6. For higher ks the silhouette index start to drop sharply until k=15, beyond which we see slow decay in silhouette values. k=6 makes biological sense as PBMCs are typically composed of several major cell types as demonstrated in the 2D UMAPs shown earlier. Using the gap statistic or WSS does not yield easily discernible k, however, both metrics appear to slowly flatten as k increases beyond 15. Hence, we can use these metrics to determine an upper bound for potentially useful k.

ggplot()+
  geom_jitter(data = b$sil_stats, aes(x = k, y = sil),
             size = 1, shape = 21, width = 0.01, height = 0)+
  ggtitle(label = "Silhouette")|ggplot()+
  geom_jitter(data = b$gap_stats, aes(x = k, y = gap),
             size = 1, shape = 21, width = 0.01, height = 0)+
  ggtitle(label = "Gap")|ggplot()+
  geom_jitter(data = b$wss_stats, aes(x = k, y = wss),
             size = 1, shape = 21, width = 0.01, height = 0)+
  ggtitle(label = "WSS")

k=6

Lets start with k=6 and use get_bubbletree_data to perform clustering followed by hierarchical clustering:

btd_k6 <- get_bubbletree_data(x = A,
                              k = 6,
                              n_start = 100,
                              iter_max = 100,
                              seed = 4321,
                              cores = 1,
                              B = 10,
                              N_eff = 100,
                              verbose = FALSE)

Next, we use get_bubbletree

btd_k6$tree

Now we see some structure (branching) in the dendrogram, i.e. bubbles 1 and 4 and 5, 3 and 2 form two distinct branches, while the smallest bubble 6 is an outgroup.

Does this make biological sense? Lets confirm this by visualizing the predicted cell type annotations:

# cell type annotations (level 1)
a <- d_t0@meta.data$celltype.l1 

# create tile plot
w <- get_annotation_tiles_char(k = btd_k6$cluster,
                               a = a,
                               tree_meta = btd_k6$tree_meta,
                               integrate_over_clusters = F,
                               round_digits = 1)

(btd_k6$tree|w$w)+patchwork::plot_layout(ncol = 2, widths = c(1, 1))

Summary of the results:

  • branch with bubble 5 and 3:

    • 100% of CD4 T-cells are found in bubble 6 and 3
    • the branch is also enriched with CD8 T-cells, NK cells, other T cells.
    • bubble 3 also contains 21% of the cells predicted as ‘other’
    • in summary: this branch contains cells related to T-cells
  • bubble 2 is related to bubbles 5 and 3:

    • this bubble is enriched with B-cells
    • it also contains 23% of DCs \(\rightarrow\) this is unusual and hence an indication that we should use higher k.
    • in summary: the branch connecting bubbles 5, 3 and 2 is enriched with lymphocytes.
  • branch with bubbles 4 and 1:

    • this branch is enriched with two related cell types, i.e. monocytes (Mono) and dendritic cells (DCs)
    • it also contains about 22% of the cells predicted as ‘other’
  • outgroup bubble 6 is enriched with cells predicted as ‘other’

k=15

It is apparent from the earlier results generated with k=6 that some bubbles are not pure, i.e. contain cell from different cell types. For instance, bubble 3 contains both CD4, CD8 and other T-cells; bubble 2 contains B-cells and dendritic cells. Lets try increasing k to 15, the upper bound of useful ks determined earlier.

Lets perform clustering with k=15.

# do clustering
btd_k15 <- get_bubbletree_data(x = A,
                              k = 15,
                              n_start = 100,
                              iter_max = 100,
                              seed = 4123,
                              cores = 1,
                              B = 10,
                              N_eff = 100,
                              verbose = FALSE)

Before looking at the results lets compare the Gini impurity index between the k=6 and k=15 clustering. For this we will use as labels the l1 cell type predictions

FALSE [1] 0.3334136
FALSE [1] 0.2125881

The effect of this is finer dendrogram structure and branches. For instance, NK cells (bubble 8) are now separated from CD4, CD8 and other T-cells, however, are found in the same branch. B-cells are found in bubble 14 and this bubble no longer contains DCs, which not appear to be part of bubble 13 and also bubble 9.

We also see finer structure in the branch of monocytes and DCs. By considering l1 it is difficult to justify this structure as all bubbles in this branch appear to have similar annotation profiles.

# cell type annotations (level 1)
a <- d_t0@meta.data$celltype.l1 

# create tile plot
w <- get_annotation_tiles_char(k = btd_k15$cluster,
                               a = a,
                               tree_meta = btd_k15$tree_meta,
                               integrate_over_clusters = F,
                               round_digits = 1)

(btd_k15$tree|w$w)+patchwork::plot_layout(ncol = 2, widths = c(1, 1))

Lets “zoom-in” by using l2 cell type annotation to better understand the structure of this branch. l2 contains 31 cell types compared to the 8 cell types in l1. We show the l2 annotations below. We can see branching of the monocytes in CD14+ and CD16+ and also monocytes. Nevertheless, even based on the l2 cell type annotation the bubbles 7, 5 and 12 are quite similar, which indicates potential oversegmentation. At the same time, bubble 14 contains both naive, memory and intermediate B-cells, suggesting potential undersegmentation. Hence, depending on the cell type resolution of interest we may a) use higher k value or b) perform clustering of cells within specific bubbles.

# cell type annotations (level 2)
a <- d_t0@meta.data$celltype.l2

# create tile plot
w <- get_annotation_tiles_char(k = btd_k15$cluster,
                               a = a,
                               tree_meta = btd_k15$tree_meta,
                               integrate_over_clusters = F,
                               round_digits = 0)

(btd_k15$tree|w$w)+patchwork::plot_layout(ncol = 2, widths = c(2, 9))

Deeper clustering of specific bubbles is available by using the function update_bubbletree_data. This function should be used to explore the data in order to select an appropriate k and to explain the results.

Let us update cluster bubble 14 into three subclusters now and bubble 11 into two . scBubbletree will update the dendrogram distances automatically. Notice that the two newly created bubbles 14_1, 14_2 and 14_3 partition the B-cells into subclusters of naive, memory and intermediate. The newly created bubble 11_1 is composed of plasmablasts and hematopoietic stem and progenitor cells (HSPC), and 11_2 contains the remaining cells from the original bubble 11.

Also notice that the branch of lymphocytes now has two pronounced sub-branches; one made up predominantly of T-cell-like cells (including NK-cells) and the other of B-cell-like cells. The plasmablasts and HSPCs are shown as an outgroup, which is also biologically sensible.

u_btd_k15 <- update_bubbletree_data(btd = btd_k15,
                                    updated_bubbles = c("14", "11"),
                                    ks = c(3, 2),
                                    cores = 20)
FALSE Updating bubble: 14 
FALSE Updating bubble: 11 
FALSE Updating dendrogram
# create tile plot
w <- get_annotation_tiles_char(k = u_btd_k15$btd$cluster,
                               a = a,
                               tree_meta = u_btd_k15$tree_meta,
                               integrate_over_clusters = F,
                               round_digits = 0)

(u_btd_k15$tree|w$w)+patchwork::plot_layout(ncol = 2, widths = c(2, 9))

Gini index

Can we show quantitatively if increasing k will contribute to “better” clustering? Yes, we can use the Gini impurity index.

For this we will integrate the results obtained by function get_k with l1 and l2 cell type predictions (labels), and show quantitatively the change in Gini index as a function of k. This is done with the function get_gini_boot.

gini_boot_lv1 <- get_gini_boot(labels = d_t0@meta.data$celltype.l1,
                               kmeans_boot_obj = b)

gini_boot_lv2 <- get_gini_boot(labels = d_t0@meta.data$celltype.l2,
                               kmeans_boot_obj = b)
ggplot(data = gini_boot_lv1$total_gini)+
  geom_jitter(aes(x = k, y = total_gini),
              size = 1, shape = 21, width = 0.1, height = 0)+
  ggtitle(label = "labels=lv1")|
  ggplot(data = gini_boot_lv2$total_gini)+
  geom_jitter(aes(x = k, y = total_gini),
              size = 1, shape = 21, width = 0.1, height = 0)+
  ggtitle(label = "labels=lv2")

We see shallow decrease in Gini index for k>15in the context of l1 or l2 annotations.

Quantiative annotations

Gene expression annotations can also be integrated to better explain the clustering structure of the dendrogram. Lets integrate some marker genes:

  • GNLY, NKG7: NK cells
  • IL7R: CD4 T cells
  • CD8A: CD8 T cells
  • MS4A1: B cells
  • CD14, LYZ: CD14+ Monocytes
  • FCGR3A, MS4A7: FCGR3A+ Monocytes
  • FCER1A, CST3: Dendritic Cells
  • PPBP: Megakaryocytes
# First we need to select gene expressions for each cell and 
# also for five marker genes
as <- as.matrix(t(d_t0@assays$SCT@data[
  rownames(d_t0@assays$SCT@data) %in% 
    c("IL7R", 
      "CD14", "LYZ", 
      "MS4A1", 
      "CD8A", 
      "GNLY", "NKG7",
      "FCGR3A", "MS4A7",
      "FCER1A", "CST3",
      "PPBP"), ]))

# 'as' is a matrix with n=rows for cells and a=columns for 
# annotations (genes). The column names will be shown in
# the plot.
# This function build the nummeric annotations plot
w1 <- get_annotation_tiles_num(k = u_btd_k15$btd$cluster,
                               as = as,
                               tree_meta = u_btd_k15$tree_meta,
                               plot_title = "",
                               round_digits = 1)

# Plot
(u_btd_k15$tree|w1$w)+patchwork::plot_layout(widths = c(1, 3))

Second, we can visualize the distribution of each marker gene in each bubble using violin plots with get_annotation_violins. This function uses the same input as get_annotation_tiles_num.

Lets invoke this function now.

w2 <- get_annotation_violins(k = u_btd_k15$btd$cluster,
                             as = as,
                             tree_meta = u_btd_k15$tree_meta,
                             plot_title = "",
                             scales = 'free_x')

((u_btd_k15$tree|w1$w)+patchwork::plot_layout(widths = c(1.5, 3)))/w2$w

Case study C 4

Data

In the final case study we will show how to use scBubbletree to perform joint visualization of scRNA-seq and scAIRR-seq (immune profiling) data. This datasets consists of about human 10,000 PBMCs of healthy female donor obtained from 10x Genomics. Sequencing of TCRs and BCRs was done for the different B- and T-cells in the sample.

For each cell we have:

  • log-normalized gene expression
  • predicted cell type (with R package singleR)
  • BCR IGH information (IGHV, IGHJ, IGHD and IGHC gene calls, CDR3 length)

Data processing and transformation is shown below:

The 2D UMAP with cells color-coded according to their predicted cell types are shown below:

# This is the main input of scBubbletree -> matrix A
A <- d@reductions$pca@cell.embeddings

# A has n=cells as rows, f=features as columns (e.g. from PCA)
dim(A)
FALSE [1] 9391   20
b <- get_k(B = 20, 
           cv_clust_p = 1, 
           cv_gap_p = 1, 
           ks = seq(from = 2, to = 30, by = 2),
           x = A,
           n_start = 10, 
           iter_max = 50, 
           cores = 20)
FALSE boot: 1  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 2  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 3  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 4  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 5  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 6  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 7  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 8  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 9  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 10  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 11  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 12  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 13  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 14  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 15  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 16  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 17  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 18  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 19  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS. 
FALSE boot: 20  : 1) clustering, 2) silhouette, 3) gap-stat, 4) WSS.

From these plots k=15 seems like a reasonable choice, which is similar to the k seen in Case Study B (also analyzing PBMCs).

Lets perform clustering with k=15 using scBubbletree.

# Perform clustering to get data for scBubbletree
btd_k15 <- get_bubbletree_data(x = A,
                              k = 15,
                              seed = 1234,
                              cores = 1,
                              B = 20,
                              N_eff = 100,
                              verbose = F)

Lets now plot the dendrogram:

btd_k15$tree

as <- as.matrix(t(d@assays$RNA@data[
  rownames(d@assays$RNA@data) %in% 
    c("IL7R", 
      "CD14", "LYZ", 
      "MS4A1", 
      "CD8A", 
      "GNLY", "NKG7",
      "FCGR3A", "MS4A7",
      "FCER1A", "CST3",
      "PPBP"), ]))

w0 <- get_annotation_tiles_char(k = btd_k15$cluster,
                               a = meta$pred_cell_type,
                               tree_meta = btd_k15$tree_meta,
                               integrate_over_clusters = F,
                               round_digits = 1)

w1 <- get_annotation_tiles_num(k = btd_k15$cluster,
                               as = as,
                               tree_meta = btd_k15$tree_meta,
                               plot_title = "",
                               round_digits = 1)

w2 <- get_annotation_tiles_char(k = btd_k15$cluster,
                               a = meta[, c("has_IGH")],
                               tree_meta = btd_k15$tree_meta,
                               integrate_over_clusters = F,
                               round_digits = 1)

a <- matrix(data = meta[, c("cdr3_length")],
       ncol = 1)
colnames(a) <- "CDR3 length [aa]"

w3 <- get_annotation_violins(k = btd_k15$cluster,
                             a = a,
                             tree_meta = btd_k15$tree_meta,
                             violin_min_cells = 5)

w4 <- get_annotation_tiles_char(k = btd_k15$cluster,
                               a = meta[, c("v_gene")],
                               tree_meta = btd_k15$tree_meta,
                               round_digits = 0,
                               integrate_over_clusters = T)

w5 <- get_annotation_tiles_char(k = btd_k15$cluster,
                               a = meta[, c("j_gene")],
                               tree_meta = btd_k15$tree_meta,
                               round_digits = 0,
                               integrate_over_clusters = T)

w6 <- get_annotation_tiles_char(k = btd_k15$cluster,
                               a = meta[, c("c_gene")],
                               tree_meta = btd_k15$tree_meta,
                               round_digits = 0,
                               integrate_over_clusters = T)



((btd_k15$tree|w0$w|w1$w)+patchwork::plot_layout(widths = c(4, 3, 3)))/
  ((w2$w|w3$w|w4$w)+patchwork::plot_layout(widths = c(2.5, 1.5, 6)))/
  ((w5$w|w6$w)+patchwork::plot_layout(widths = c(1, 1)))

References


  1. Tian, Luyi, et al. “Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments.” Nature methods 16.6 (2019): 479-487.↩︎

  2. Hao, Yuhan, et al. “Integrated analysis of multimodal single-cell data.” Cell 184.13 (2021): 3573-3587.↩︎

  3. https://satijalab.org/seurat/articles/multimodal_reference_mapping.html↩︎

  4. https://www.10xgenomics.com/resources/datasets/10-k-human-pbm-cs-5-v-2-0-chromium-x-2-standard-6-1-0↩︎